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Abstract. The inclusion of a macroscopic adaptive threshold is studied 
for the retrieval dynamics of layered feedforward neural network models 
with synaptic noise. It is shown that if the threshold is chosen appro- 
priately as a function of the cross-talk noise and of the activity of the 
stored patterns, adapting itself automatically in the course of the recall 
process, an autonomous functioning of the network is guaranteed. This 
self-control mechanism considerably improves the quality of retrieval, in 
particular the storage capacity, the basins of attraction and the mutual 
information content. 



1 Introduction 

As is common knowledge by now, layered feedforward neural network models 
are the workhorses in many practical applications in several areas of research 
and, therefore, any new insight in their capabilities and limitations should thus 
be welcome. In view of the fact that in many of these applications, e.g., pattern 
recognition in general, information is mostly encoded by a small fraction of bits 
and that also in neurophysiological studies the activity level of real neurons is 
found to be low, any reasonable network model has to allow variable activity 
of the neurons. The limit of low activity, i.e., sparse coding is then especially 
interesting. Indeed, sparsely coded models have a very large storage capacity 
behaving as l/(alno) for small a, where a is the activity (see, e.g., |H2l.'ill| 
and references therein). However, for low activity the basins of attraction might 
become very small and the information content in a single pattern is reduced 
0j. Therefore, the necessity of a control of the activity of the neurons has been 
emphasized such that the latter stays the same as the activity of the stored 
patterns during the recall process. This has led to several discussions imposing 
external constraints on the dynamics of the network. However, the enforcement of 
such a constraint at every time step destroys part of the autonomous functioning 
of the network, i.e., a functioning that has to be independent precisely from such 
external constraints or control mechanisms. To solve this problem, quite recently 
a self-control mechanism has been introduced in the dynamics of networks for 
so-called diluted architectures This self-control mechanism introduces a time- 
dependent threshold in the transfer function |5l(i| . It is determined as a function 



of both the cross-talk noise and the activity of the stored patterns in the network, 
and adapts itself in the course of the recall process. It furthermore allows to reach 
optimal retrieval performance both in the absence and in the presence of synaptic 
noise |5lfil7l8| . These diluted architectures contain no common ancestors nodes, 
in contrast with feedforward architectures. It has then been shown that a similar 
mechanism can be introduced succesfully for layered feedforward architectures 
but, without synaptic noise 9 . 

The purpose of the present contribution is to generalise this self-control mech- 
anism for layered architectures when synaptic noise is allowed, and to show that 
it leads to a substantial improvement of the quality of retrieval, in particular the 
storage capacity, the basins of attraction and the mutual information content. 



2 The model 

Consider a neural network composed of binary neurons arranged in layers, each 
layer containing N neurons. A neuron can take values <Ji{t) G {0,1} where 
t = 1, . . . , L is the layer index and i = 1, . . . , N labels the neurons. Each neu- 
ron on layer t is unidirectionally connected to all neurons on layer t + 1. We 
want to memorize p patterns {£f (i)}, i = 1, . . . , N, fi = 1, . . . ,p on each layer i, 
taking the values {0,1}. They are assumed to be independent identically dis- 
tributed random variables (i.i.d.r.v.) with respect to i, [i and t, determined by 
the probability distribution: p(t$(t)) = a6(f£(t) - 1) + (1- a)6(t£(t)). From this 
form we find that the expectation value and the variance of the patterns are 
given by E[^(t)] — E[£?(t) 2 ] — a . Moreover, no statistical correlations occur, 
in fact for [i^v the covariance vanishes. 

The state Oi(t + 1) of neuron i on layer t + 1 is determined by the state of 
the neurons on the previous layer t according to the stochastic rule 

P(ai(t + 1) | oi(t), . . . , a N (t)) = {1 + cxp[2(2a 4 (i + 1) - l)^*)]}" 1 . (1) 

The right hand side is the logistic function. The "temperature" T = 1/(3 con- 
trols the stochasticity of the network dynamics, it measures the synaptic noise 
level [HI- Given the network state {ci(i)}; i = 1, ■ • • , N on layer t, the so-called 
"local field" hi(t) of neuron i on the next layer t + 1 is given by 

N 

hi(t) = J2 Jij(t)(*j(t) - «) - ( 2 ) 

with 0(t) the threshold to be specified later. The couplings Jij(t) are the synaptic 
strengths of the interaction between neuron j on layer t and neuron i on layer 
t + 1. They depend on the stored patterns at different layers according to the 
covariance rule 

J « ^ = MnH-n) + 1) - *)(#(*) - ^ • (3) 



These couplings then permit to store sets of patterns to be retrieved by the 
layered network. 

The dynamics of this network is defined as follows (see JI]). Initially the 
first layer (the input) is externally set in some fixed state. In response to that, 
all neurons of the second layer update synchronously at the next time step, 
according to the stochastic rule and so on. 

At this point we remark that the couplings J3J are of infinite range (each 
neuron interacts with infinitely many others) such that our model allows a so- 
called mean-field theory approximation. This essentially means that we focus 
on the dynamics of a single neuron while replacing all the other neurons by 
an average background local field. In other words, no fluctuations of the other 
neurons are taken into account. In our case this approximation becomes exact 
because, crudely speaking, hi(t) is the sum of very many terms and a central 
limit theorem can be applied 10 . 

It is standard knowledge by now that mean-field theory dynamics can be 
solved exactly for these layered architectures (e.g., [11112] '). By exact analytic 
treatment we mean that, given the state of the first layer as initial state, the state 
on layer t that results from the dynamics is predicted by recursion formulas. This 
is essentially due to the fact that the representations of the patterns on different 
layers are chosen independently. Hence, the big advantage is that this will allow 
us to determine the effects from self-control in an exact way. 

The relevant parameters describing the solution of this dynamics are the 
main overlap of the state of the network and the /i-th pattern, and the neural 
activity of the neurons 



In order to measure the retrieval quality of the recall process, we use the mu- 
tual information function | 5IBIlHll4| . In general, it measures the average amount 
of information that can be received by the user by observing the signal at the 
output of a channel |15I16| . For the recall process of stored patterns that we 
are discussing here, at each layer the process can be regarded as a channel with 
input (t) and output cr, (t) such that this mutual information function can be 
defined as |5I15| 



where S(o~i(t)) and S (ai(t)\^ (t)) are the entropy and the conditional entropy 
of the output, respectively 




(5) 




(6) 



a, 




These information entropies are peculiar to the probability distributions of the 
output. The quantity p(ai(t)) denotes the probability distribution for the neu- 
rons at layer t and p(<7i(t)|£f (t)) indicates the conditional probability that the 
i-th neuron is in a state (Ji(t) at layer t given that the i-th site of the pattern 
to be retrieved is £f (t). Hereby, we have assumed that the conditional prob- 
ability of all the neurons factorizes, i.e., p{{cri(t)}\{£,i(t)}) — Y[jP( a j(^)\^j(^))y 
which is a consequence of the mean-field theory character of our model explained 
above. We remark that a similar factorization has also been used in Schwenker 
et al. HZ|. 

The calculation of the different terms in the expression © proceeds as fol- 
lows. Because of the mean-field character of our model the following formula hold 
for every neuron i on each layer t. Formally writing (forgetting about the pattern 
index fi) (O) = ((0) a ^)^ = Yl^PiQ ^l a P( a \€)0 f° r an arbitrary quantity O the 
conditional probability can be obtained in a rather straightforward way by using 
the complete knowledge about the system: (£) = a, (<r) = q 7 {{a — a)(£ — a)) = 
M, (1) = 1. 

The result reads 

PMO = [7o£ + (7i - 7o)£]<5(a - 1) + [1 - 70 - ( 7 i - 7o)£]%) (8) 

where 70 = q — aM and 71 = (1 — a)M+ q, and where the M and q are precisely 
the relevant parameters Q for large N. Using the probability distribution of the 
patterns we obtain 

p(a) = qS(a - 1) + (1 - q)5(a) . (9) 
Hence the entropy JHJ and the conditional entropy become 

S(a) = -qlnq-(l-q)Ml-q) (10) 
SHO = - [ 7 o + (71 - 7o)£] ln[7o + (71 - 7o)£] 

- [1 - 7o - (71 - 7o)£] ln[l - 7 o - (71 - 7o)£] • (H) 

By averaging the conditional entropy over the pattern £ we finally get for the 
mutual information function Q for the layered model 

I(°">0 = -qlnq - (1 - q) ln(l - q) + a [71 In 71 + (1 - 7i)ln(l - 71)] 

+(l-a)[7oln 7 o + (l-7o)hi(l-7o)] • (12) 

3 Adaptive thresholds 

It is standard knowledge (e.g., that the synchronous dynamics for layered 
architectures can be solved exactly following the method based upon a signal- 
to-noise analysis of the local field J2J (e.g., (4ll2ll8lTT?| and references therein). 
Without loss of generality we focus on the recall of one pattern, say fx = 1, 
meaning that only M 1 (i) is macroscopic, i.e., of order 1 and the rest of the 
patterns causes a cross-talk noise at each step of the dynamics. 



We suppose that the initial state of the network model {<7i(l)} is a collection 
of i.i.d.r.v. with average and variance given by B[oi(l)] = E[(ai(l)) 2 ] = qo . We 
furthermore assume that this state is correlated with only one stored pattern, 
say pattern fx = 1, such that Cov(£f (1), Ci(l)) = S^i Mq a(l — a) . 

Then the full recall proces is described by |11I12| 



M 1 (t+ 1) = 



1 



T>x tanh 



/3((1 - a)M x {t) - 0(t) + y/aD{t)x) 



T>x tanh 
q(t + l) = aM 1 (t + l) 

+ i |l + J Da; tanh 
D{t + 1) = Q(t + 1) 



Pi-aM 1 ^) - 0(f) + y/aD(t) x) 



Pi-aM 1 ^) - 0(f) + ^aD{t) x) 



(13) 
(14) 



1 - a I Dxtanh^ 



(1 - a)M l {t) - 9{t) + ^JaD{t) x 



{I -a) Vx tanh 2 (3 -aM\t) - 9{t) + y/aD(t) x \ D(t) (15) 



where a — p/N, T>x is the Gaussian measure Vx = dx(2n)- 1 ^ 2 exp(-x 2 /2), 
where Q(i) — [(1 — 2a)q(t) + a 2 ] and where D(t) contains the influence of the 
cross-talk noise caused by the patterns /i > 1. As mentioned before, 9 it) is an 
adaptive threshold that has to be chosen. 

In the sequel we discuss two different choices and both will be compared for 
networks with synaptic noise and various activities. Of course, it is known that 
the quality of the recall process is influenced by the cross-talk noise. An idea is 
then to introduce a threshold that adapts itself autonomously in the course of 
the recall process and that counters, at each layer, the cross-talk noise. This is 
the self-control method proposed in jS] . This has been studied for layered neural 
network models without synaptic noise, i.e., at T = 0, where the rule QJ reduces 
to the deterministic form ai(t + 1) = @(/ij(t)) with 0(x) the Heaviside function 
taking the value {0, 1}. For sparsely coded models, meaning that the pattern 
activity a is very small and tends to zero for N large, it has been found [5] that 



0(t) sc = c(a)y/aD(t), c{a) = V-2ma 



(16) 



makes the second term on the r.h.s of Ea. p4|l at T = 0, asymptotically vanish 
faster than a such that q ~ a. It turns out that the inclusion of this self-control 
threshold considerably improves the quality of retrieval, in particular the storage 
capacity, the basins of attraction and the information content. 

The second approach chooses a threshold by maximizing the information 
content, i = al of the network (recall Eq. $H^)- This function depends on 
M l {t), q(t), a, a and (3. The evolution of M x (t) and of q(t) (O, (HJ depends 
on the specific choice of the threshold through the local field (J2J. We consider 
a layer independent threshold 9(t) = 9 and calculate the value of (|12|) for fixed 



a, a, Mq, qo and j3. The optimal threshold, 9 — 9 opt , is then the one for which 
the mutual information function is maximal. The latter is non-trivial because 
it is even rather difficult, especially in the limit of sparse coding, to choose a 
threshold interval by hand such that i is non-zero. The computational cost will 
thus be larger compared to the one of the self-control approach. To illustrate 
this we plot in Figure the information content i as a function of 9 without 
self-control or a priori optimization, for a = 0.005 and different values of a. For 
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Fig. 1. The information i = al as a function of 9 for a = 0.005, T = 0.1 and 
several values of the load parameter a — 0.1, 1, 2, 4, 6 (bottom to top) 



every value of a, below its critical value, there is a range for the threshold where 
the information content is different from zero and hence, retrieval is possible. 
This retrieval range becomes very small when the storage capacity approaches 
its critical value a c — 6.4. 

Concerning then the self-control approach, the next problem to be posed 
in analogy with the case without synaptic noise is the following one. Can one 
determine a form for the threshold 9[t) such that the integral in the second term 
on the r.h.s of Ea. l|14(l at T ^ vanishes asymptotically faster than al 

In contrast with the case at zero temperature where due to the simple form 
of the transfer function, this threshold could be determined analytically (recall 
Eq. pfifl). a detailed study of the asymptotics of the integral in Eq. Ijl4|l gives 
no satisfactory analytic solution. Therefore, we have designed a systematic nu- 
merical procedure through the following steps: 

— Choose a small value for the activity a'. 

— Determine through numerical integration the threshold 6' such that 

r°° dx e~ x2/2 ' 7 ' 2 

/ 0{x ~9)<a' for 9 > 9' (17) 

J-oo ay2TT 



for different values of the variance a 2 — aD(t). 



— Determine as a function of T = 1 /(3, the value for 0' T such that 



2oV27r 



[1 + tanh[/?(x 



< a' for 9>6' + > 



(18) 



The second step leads precisely to a threshold having the form of Eq. lfTH|l . The 
third step determining the temperature-dependent part 9' T leads to the final 
proposal 

6 t (a,T) = v / -21n(a)a£)(i) ~^\n(a)T 2 . (19) 

This dynamical threshold is again a macroscopic parameter, thus no average 
must be taken over the microscopic random variables at each step t of the recall 
process. 

We have solved these self-controlled dynamics, Eqs. (|1,S|1 - (|15II and 1)19(1 . for 
our model with synaptic noise, in the limit of sparse coding, numerically. In 
particular, we have studied in detail the influence of the T-dependent part of 
the threshold. Of course, we are only interested in the retrieval solutions with 
M > (we forget about the index 1) and carrying a non-zero information i = al. 
The important features of the solution are illustrated, for a typical value of a in 
Figures EH In Figure we show the basin of attraction for the whole retrieval 




Fig. 2. The basin of attraction as a function of a for a = 0.005 and T — 
0.2, 0.15, 0.1, 0.05 (from left to right) with (full lines) and without (dashed lines) 
the T-dependent part in the threshold (|19fl . 



phase for the model with threshold (|16|) (dashed curves) compared to the model 
with the noise-dependent threshold (|TT)|l (full curves). We see that there is no 
clear improvement for low T but there is a substantial one for higher T. Even 
near the border of critical storage the results are still improved such that also 
the storage capacity itself is larger. 

This is further illustrated in Figure [3] where we compare the evolution of the 
retrieval overlap M(t) starting from several initial values, Mo, for the model with 
(Figure |2| (a)) and without (Figure |3| (b)) the T-correction in the threshold and 
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Fig. 3. The evolution of the main overlap M(t) for several initial values Mo with 
T = 0.2, go = a = 0.005, a = 1 for the self-control model 1)19(1 without (a) and 
with T-dependent part (b) and for the optimal threshold model (c). 

for the optimal threshold model (Figure 13(c)). Here this temperature correction 
is absolutely crucial to guarantee retrieval, i.e., M « 1. It really makes the 
difference between retrieval and non-retrieval in the model. Furthermore, the 
model with the self-control threshold with noise-correction has even a wider 
basin of attraction than the model with optimal threshold. 

In Figure0|we plot the information content i as a function of the temperature 
for the self-control dynamics with the threshold 1(19(1 (full curves), respectively 
1)16(1 (dashed curves) . We see that a substantial improvement of the information 
content is obtained. 
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Fig. 4. The information content i = al as a function of T for several values 
of the loading a and a = 0.005 with (full lines) and without (dashed lines) the 
T-correction in the threshold. 

Finally we show in Figure a T — a plot for a = 0.005 (a) and a = 0.02 
(b) with (full line) and without (dashed line) noise-correction in the self-control 




threshold and with optimal threshold (dotted line). These lines indicate two 
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Fig. 5. Phases in the T - a plane for a = 0.005 (a) and a = 0.02 (b) with (full 
line) and without (dashed line) the temperature correction in the self-control 
threshold and with optimal threshold (dotted line). 

phases of the layered model: below the lines our model allows recall, above the 
lines it docs not. For a = 0.005 we see that the T-dependent term in the self- 
control threshold leads to a big improvement in the region for large noise and 
small loading and in the region of critical loading. For a — 0.02 the results for the 
self-control threshold with and without noise-correction and those for the optimal 
thresholds almost coincide, but we recall that the calculation with self-control is 
autonomously done by the network and less demanding computationally. 

4 Conclusions 

In this work we have studied the inclusion of an adaptive threshold in sparsely 
coded layered neural networks with synaptic noise. We have presented an an- 
alytic form for a self-control threshold, allowing an autonomous functioning of 
the network, and compared it with an optimal threshold obtained by maximizing 
the mutual information which has to be calculated externally each time one of 
the network parameters (activity, loading, temperature) is changed. The conse- 
quences of this self-control mechanism on the quality of the recall process have 
been studied. 

We find that the basins of attraction of the retrieval solutions as well as the 
storage capacity are enlarged. For some activities the self-control threshold even 
sets the border between retrieval and non-retrieval. This confirms the consider- 
able improvement of the quality of recall by self-control, also for layered network 
models with synaptic noise. 



This allows us to conjecture that self-control might be relevant for other ar- 
chitectures in the presence of synaptic noise, and even for dynamical systems in 
general, when trying to improve, e.g., basins of attraction . 
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